Doubly Robust Off-policy Value Evaluation for Reinforcement Learning
ثبت نشده
چکیده
Proof. For the base case t = H + 1, since V 0 DR = V (s H+1) = 0, it is obvious that at the (H + 1)-th step the estimator is unbiased with 0 variance, and the theorem holds. For the inductive step, suppose the theorem holds for step t + 1. At time step t, we have: V t V H+1−t DR = E t V H+1−t DR 2 − E t V (s t) 2 = E t V (s t) + ρ t r t + γV H−t DR − Q(s t , a t) 2 − V (s t) 2 + V t V (s t) = E t ρ t Q(s t , a t) − ρ t Q(s t , a t) + V (s t) + ρ t r t + γV H−t DR − Q(s t , a t) 2 − V (s t) 2 + V t V (s t) = E t − ρ t ∆(s t , a t) + V (s t) + ρ t (r t − R(s t , a t)) + ρ t γ V H−t DR − E t+1 V (s t+1) 2 − V (s t) 2 + V t V (s t) (15) = E t E t − ρ t ∆(s t , a t) + V (s t) 2 − V (s t) 2 s t + E t E t+1 ρ 2 t (r t − R(s t , a t)) 2 + V t V (s t) + E t E t+1 ρ t γ V H−t DR − E t+1 V (s t+1) 2 = E t V t − ρ t ∆(s t , a t) + V (s t) s t + E t ρ 2 t V t+1 r t + E t ρ 2 t γ 2 V V H−t DR s t , a t + V t V (s t) = E t V t ρ t ∆(s t , a t) s t + E t ρ 2 t V t+1 r t + E t ρ 2 t γ 2 V t+1 V H−t DR + V t V (s t). This completes the proof. Note that from Eqn.(15) to the next step, we have used the fact that conditioned on s t and a t , …
منابع مشابه
Doubly Robust Off-policy Value Evaluation for Reinforcement Learning
We study the problem of off-policy value evaluation in reinforcement learning (RL), where one aims to estimate the value of a new policy based on data collected by a different policy. This problem is often a critical step when applying RL to real-world problems. Despite its importance, existing general methods either have uncontrolled bias or suffer high variance. In this work, we extend the do...
متن کاملDoubly Robust Off-policy Evaluation for Reinforcement Learning
We study the problem of evaluating a policy that is different from the one that generates data. Such a problem, known as off-policy evaluation in reinforcement learning (RL), is encountered whenever one wants to estimate the value of a new solution, based on historical data, before actually deploying it in the real system, which is a critical step of applying RL in most real-world applications....
متن کاملMore Robust Doubly Robust Off-policy Evaluation
We study the problem of off-policy evaluation (OPE) in reinforcement learning (RL), where the goal is to estimate the performance of a policy from the data generated by another policy(ies). In particular, we focus on the doubly robust (DR) estimators that consist of an importance sampling (IS) component and a performance model, and utilize the low (or zero) bias of IS and low variance of the mo...
متن کاملData-Efficient Off-Policy Policy Evaluation for Reinforcement Learning
In this paper we present a new way of predicting the performance of a reinforcement learning policy given historical data that may have been generated by a different policy. The ability to evaluate a policy from historical data is important for applications where the deployment of a bad policy can be dangerous or costly. We show empirically that our algorithm produces estimates that often have ...
متن کاملDoubly Robust Policy Evaluation and Learning
We study decision making in environments where the reward is only partially observed, but can be modeled as a function of an action and an observed context. This setting, known as contextual bandits, encompasses a wide variety of applications including health-care policy and Internet advertising. A central task is evaluation of a new policy given historic data consisting of contexts, actions an...
متن کاملEligibility Traces for Off-Policy Policy Evaluation
Eligibility traces have been shown to speed reinforcement learning, to make it more robust to hidden states, and to provide a link between Monte Carlo and temporal-difference methods. Here we generalize eligibility traces to off-policy learning, in which one learns about a policy different from the policy that generates the data. Off-policy methods can greatly multiply learning, as many policie...
متن کامل